Model Selection

Cross-modal Pretraining

# Cross-modal Pretraining

Vit Large Patch16 Siglip 512.v2 Webli

ViT image encoder based on SigLIP 2, designed for timm, suitable for vision-language tasks

Image Classification

Vit Base Patch16 Siglip 256.webli I18n

ViT-B-16 vision Transformer model based on SigLIP, containing only the image encoder, utilizing raw attention pooling

Image Classification

Speecht5 Tts Hr

A SpeechT5 text-to-speech fine-tuned model optimized for Croatian, trained on Microsoft's SpeechT5 architecture and the VoxPopuli dataset

Speech Synthesis

Transformers Other

A SpeechT5 automatic speech recognition model fine-tuned on the LibriSpeech dataset, supporting speech-to-text conversion.

Speech Recognition

Xclip Base Patch16 Hmdb 8 Shot

X-CLIP is an extended version of CLIP for general video-language understanding, trained through contrastive learning on video-text pairs, suitable for video classification and video-text retrieval tasks.

Transformers English

Unixcoder Base Nine

UniXcoder is a unified multimodal pretraining model that leverages multimodal data (such as code comments and abstract syntax trees) to pretrain code representations.

Multimodal Fusion

Transformers English

UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.

Multimodal Fusion

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase